These are the first experiments where I use version 2 chemistry. Read 2 is now the barcode and UMI. Read 1 is the cDNA. This was done to avoid reading through the TSO on Ilumina read 1
This sequencing run consists of 3 experiments:
Check the samples cluster by their cell type
This was generated in notebook 1A_generateSCE.
Library size normalization and transformation
set.seed(666)
lib.sf <- librarySizeFactors(sce)
sce <- computeSumFactors(sce)
sce <- logNormCounts(sce)Here we follow the example of https://bioconductor.org/books/release/OSCA/feature-selection.html#variance-of-the-log-counts
dec.sce <- modelGeneVar(sce)
fit.sce <- metadata(dec.sce)
hvg.sce.var <- getTopHVGs(dec.sce, n=1000)
sce <- scater::runPCA(sce, subset_row=hvg.sce.var, ncomponents=10)Visulaise the fit. According to OSCA book:
At any given abundance, we assume that the variation in expression for most genes is driven by uninteresting processes like sampling noise. Under this assumption, the fitted value of the trend at any given gene’s abundance represents an estimate of its uninteresting variation, which we call the technical component.
Therefore the efitted line represents technical variation of which is much higher in version 1 where I used the read object compared to version 2 where the UMI object is used.
We then define the biological component for each gene as the difference between its total variance and the technical component. This biological component represents the “interesting” variation for each gene and can be used as the metric for HVG selection.
Build the object
Make the plot of technical and biological variation.
plt1 <- ggplot(tb,
aes(x = Mean, y= Variance)) +
geom_point(alpha = 0.2, size=0.5) +
guides(colour = guide_legend(override.aes = list(size=2, alpha=1))) +
xlab("Mean of log-expression") +
ylab("Variance of log-expression") +
scale_colour_brewer(type="qualitative", palette = "Dark2") +
geom_function(fun = fit.sce$trend, colour = "darkgreen") +
theme_Publication(base_size = 16)
plt1The coloured line represents technical variation
This includes my samples too. Separation is very clearly human and mouse.
This distort the PCA and should be removed anyway when I do DE
tb <- as_tibble(colData(sce))
plt1 <- ggplot(tb,
aes(x = sum, y= detected, colour=Storage)) +
geom_point(size=1.5) +
xlab("UMI counts") +
ylab("Genes detected") +
scale_x_continuous(trans='log10') +
annotation_logticks(base = 10, sides = "b") +
ylim(0,20000) +
scale_colour_brewer(type="qualitative", palette = "Dark2") +
theme_Publication(base_size = 16)Remove the low quality samples. Focus on the brain cancer samples that were stored frozen.
sce <- sce[,sce$Researcher == "ZM"]
sce <- sce[,sce$Storage == "Freezer"]
sce <- sce[,sce$sum > 60000]tb <- as_tibble(colData(sce))
plt2 <- ggplot(tb,
aes(x = sum, y= detected, colour=Storage)) +
geom_point(size=1.5) +
xlab("UMI counts") +
ylab("Genes detected") +
scale_x_continuous(trans='log10') +
annotation_logticks(base = 10, sides = "b") +
ylim(0,20000) +
scale_colour_brewer(type="qualitative", palette = "Dark2") +
theme_Publication(base_size = 16)
plt1 / plt2Library size normalization and transformation
set.seed(666)
lib.sf <- librarySizeFactors(sce)
sce <- computeSumFactors(sce)
sce <- logNormCounts(sce)Feature selection
There is a lot more variation than the example at https://bioconductor.org/books/release/OSCA/feature-selection.html#variance-of-the-log-counts
Choose 1000 genes instead of standard 1000 because of the presence of human and mouse genes.
dec.sce <- modelGeneVar(sce)
fit.sce <- metadata(dec.sce)
hvg.sce.var <- getTopHVGs(dec.sce, n=1000)
sce <- runPCA(sce, subset_row=hvg.sce.var, ncomponents=10)Visulaise the fit. According to OSCA book:
At any given abundance, we assume that the variation in expression for most genes is driven by uninteresting processes like sampling noise. Under this assumption, the fitted value of the trend at any given gene’s abundance represents an estimate of its uninteresting variation, which we call the technical component.
Therefore the efitted line represents technical variation of which is much higher in version 1 where I used the read object compared to version 2 where the UMI object is used.
We then define the biological component for each gene as the difference between its total variance and the technical component. This biological component represents the “interesting” variation for each gene and can be used as the metric for HVG selection.
Build the object
plt1 <- ggplot(tb,
aes(x = Mean, y= Variance)) +
geom_point(alpha = 0.2, size=0.5) +
guides(colour = guide_legend(override.aes = list(size=2, alpha=1))) +
xlab("Mean of log-expression") +
ylab("Variance of log-expression") +
scale_colour_brewer(type="qualitative", palette = "Dark2") +
geom_function(fun = fit.sce$trend, colour = "darkgreen") +
theme_Publication(base_size = 16)
plt1The coloured line represents technical variation
Convert some drug metadata to factors
sce$Dose_M <- as.numeric(sce$Dose_M)
sce$Day_Exposure <- as.factor(sce$Day_Exposure)
# Reorder the levels
sce$Drug <- factor(sce$Drug, levels = c("TMZ", "DMSO"))
print(sce$Drug)## [1] TMZ TMZ TMZ TMZ TMZ TMZ TMZ DMSO TMZ TMZ TMZ TMZ DMSO TMZ TMZ
## [16] TMZ TMZ TMZ DMSO TMZ TMZ TMZ TMZ TMZ TMZ TMZ DMSO TMZ TMZ TMZ
## [31] TMZ TMZ TMZ TMZ TMZ TMZ TMZ TMZ TMZ TMZ TMZ TMZ TMZ TMZ TMZ
## [46] TMZ TMZ DMSO DMSO TMZ TMZ TMZ TMZ DMSO TMZ TMZ TMZ TMZ TMZ TMZ
## [61] DMSO TMZ TMZ TMZ TMZ
## Levels: TMZ DMSO
The dose is related to the number of genes detected
plt1 <- plotPCA(sce, colour_by="Drug") +
theme_Publication()
plt2 <- plotPCA(sce, colour_by="Dose_M", shape_by="Day_Exposure", point_size=2.5) +
theme_Publication()
plt3 <- plotPCA(sce, colour_by="Day_Exposure", shape_by="Drug", point_size=2.5) +
theme_Publication()
plt4 <- plotPCA(sce, colour_by="detected") +
theme_Publication()View timepoint and Drug. Set the Drug as a a letter.
pca_tb <- as_tibble(reducedDim(sce))
pca_tb$Day_Exposure <- sce$Day_Exposure
pca_tb$Drug <- substr(x = sce$Drug, start = 1, stop = 1)
ggplot(pca_tb, aes(x=PC1, y=PC2, colour=Day_Exposure, label=Drug)) +
geom_text(size=4.5) +
xlab("PC1 (35%)") + ylab("PC2 (5%)") +
scale_colour_brewer(palette = "Dark2", name = "Days") +
theme_Publication()View library size and Drug. Set the Drug as a a shape.
Some relationship with library size but I don’t think it explains everything
pca_tb <- as_tibble(reducedDim(sce))
pca_tb$sum <- log(sce$sum+1)
pca_tb$Drug <- sce$Drug
ggplot(pca_tb, aes(x=PC1, y=PC2, colour=sum, shape=Drug)) +
geom_point(size=3) +
xlab("PC1 (35%)") + ylab("PC2 (5%)") +
theme_Publication()Conventional visualisation
View dose and timepoint
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Red Hat Enterprise Linux 9.4 (Plow)
##
## Matrix products: default
## BLAS: /stornext/System/data/software/rhel/9/base/tools/R/4.4.1/lib64/R/lib/libRblas.so
## LAPACK: /stornext/System/data/software/rhel/9/base/tools/R/4.4.1/lib64/R/lib/libRlapack.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ggthemes_5.1.0 here_1.0.1
## [3] knitr_1.48 patchwork_1.3.0
## [5] scater_1.32.1 scran_1.32.0
## [7] scuttle_1.14.0 lubridate_1.9.3
## [9] forcats_1.0.0 stringr_1.5.1
## [11] dplyr_1.1.4 purrr_1.0.2
## [13] readr_2.1.5 tidyr_1.3.1
## [15] tibble_3.2.1 ggplot2_3.5.1
## [17] tidyverse_2.0.0 SingleCellExperiment_1.26.0
## [19] SummarizedExperiment_1.34.0 Biobase_2.64.0
## [21] GenomicRanges_1.56.2 GenomeInfoDb_1.40.1
## [23] IRanges_2.38.1 S4Vectors_0.42.1
## [25] BiocGenerics_0.50.0 MatrixGenerics_1.16.0
## [27] matrixStats_1.4.1
##
## loaded via a namespace (and not attached):
## [1] gridExtra_2.3 rlang_1.1.4
## [3] magrittr_2.0.3 compiler_4.4.1
## [5] DelayedMatrixStats_1.26.0 vctrs_0.6.5
## [7] pkgconfig_2.0.3 crayon_1.5.3
## [9] fastmap_1.2.0 XVector_0.44.0
## [11] labeling_0.4.3 utf8_1.2.4
## [13] rmarkdown_2.28 tzdb_0.4.0
## [15] UCSC.utils_1.0.0 ggbeeswarm_0.7.2
## [17] xfun_0.48 bluster_1.14.0
## [19] zlibbioc_1.50.0 cachem_1.1.0
## [21] beachmat_2.20.0 jsonlite_1.8.9
## [23] highr_0.11 DelayedArray_0.30.1
## [25] BiocParallel_1.38.0 irlba_2.3.5.1
## [27] parallel_4.4.1 cluster_2.1.6
## [29] R6_2.5.1 RColorBrewer_1.1-3
## [31] bslib_0.8.0 stringi_1.8.4
## [33] limma_3.60.6 jquerylib_0.1.4
## [35] Rcpp_1.0.13 Matrix_1.7-0
## [37] igraph_2.0.3 timechange_0.3.0
## [39] tidyselect_1.2.1 rstudioapi_0.17.0
## [41] abind_1.4-8 yaml_2.3.10
## [43] viridis_0.6.5 codetools_0.2-20
## [45] lattice_0.22-6 withr_3.0.1
## [47] evaluate_1.0.1 pillar_1.9.0
## [49] generics_0.1.3 rprojroot_2.0.4
## [51] hms_1.1.3 sparseMatrixStats_1.16.0
## [53] munsell_0.5.1 scales_1.3.0
## [55] glue_1.8.0 metapod_1.12.0
## [57] tools_4.4.1 BiocNeighbors_1.22.0
## [59] ScaledMatrix_1.12.0 locfit_1.5-9.10
## [61] cowplot_1.1.3 edgeR_4.2.2
## [63] colorspace_2.1-1 GenomeInfoDbData_1.2.12
## [65] beeswarm_0.4.0 BiocSingular_1.20.0
## [67] vipor_0.4.7 cli_3.6.3
## [69] rsvd_1.0.5 fansi_1.0.6
## [71] S4Arrays_1.4.1 viridisLite_0.4.2
## [73] gtable_0.3.5 sass_0.4.9
## [75] digest_0.6.37 SparseArray_1.4.8
## [77] ggrepel_0.9.6 dqrng_0.4.1
## [79] farver_2.1.2 htmltools_0.5.8.1
## [81] lifecycle_1.0.4 httr_1.4.7
## [83] statmod_1.5.0